Goto

Collaborating Authors

 visual odometry



Bias-Eliminated PnP for Stereo Visual Odometry: Provably Consistent and Large-Scale Localization

Zeng, Guangyang, Shen, Yuan, Hong, Ziyang, Hong, Yuze, Ila, Viorela, Shi, Guodong, Wu, Junfeng

arXiv.org Artificial Intelligence

--In this paper, we first present a bias-eliminated weighted (Bias-Eli-W) perspective-n-point (PnP) estimator for stereo visual odometry (VO) with provable consistency. Specifically, leveraging statistical theory, we develop an asymptotically unbiased and n-consistent PnP estimator that accounts for varying 3D triangulation uncertainties, ensuring that the relative pose estimate converges to the ground truth as the number of features increases. Next, on the stereo VO pipeline side, we propose a framework that continuously triangulates contemporary features for tracking new frames, effectively decoupling temporal dependencies between pose and 3D point errors. We integrate the Bias-Eli-W PnP estimator into the proposed stereo VO pipeline, creating a synergistic effect that enhances the suppression of pose estimation errors. Experimental results demonstrate that our method: 1) achieves significant improvements in both relative pose error and absolute trajectory error in large-scale environments; 2) provides reliable localization under erratic and unpredictable robot motions. The successful implementation of the Bias-Eli-W PnP in stereo VO indicates the importance of information screening in robotic estimation tasks with high-uncertainty measurements, shedding light on diverse applications where PnP is a key ingredient. Index T erms --Stereo visual odometry, PnP pose estimation, large-scale localization, consistent estimator . ISUAL odometry (VO) refers to estimating the pose of a moving camera in a 3D space from sequential images captured by the camera. The significance of VO stems from its advantages of being infrastructure-free, cost-effective, lightweight, energy-efficient, etc [1, 2, 3]. It enables robots to perceive and navigate their environment autonomously. Compared with monocular VO, stereo VO offers several advantages, such as scale consistency, better accuracy, and enhanced robustness, due to its ability to perceive depth directly [4, 5]. Existing VO methods typically optimize both camera poses and 3D map points simultaneously, with the map being used to track new frames through the perspective-n-point (PnP) algorithm [1, 6, 4].


BEV-ODOM2: Enhanced BEV-based Monocular Visual Odometry with PV-BEV Fusion and Dense Flow Supervision for Ground Robots

Wei, Yufei, Lu, Wangtao, Lu, Sha, Hu, Chenxiao, Han, Fuzhang, Xiong, Rong, Wang, Yue

arXiv.org Artificial Intelligence

Abstract--Bird's-Eye-View (BEV) representation offers a metric-scaled planar workspace, facilitating the simplification of 6-DoF ego-motion to a more robust 3-DoF model for monocular visual odometry (MVO) in intelligent transportation systems. However, existing BEV methods suffer from sparse supervision signals and information loss during perspective-to-BEV projection. Our approach introduces: (1) dense BEV optical flow supervision constructed from 3-DoF pose ground truth for pixel-level guidance; (2) PV-BEV fusion that computes correlation volumes before projection to preserve 6-DoF motion cues while maintaining scale consistency. The framework employs three supervision levels derived solely from pose data: dense BEV flow, 5-DoF for the PV branch, and final 3-DoF output. Extensive evaluation on KITTI, NCL T, Oxford, and our newly collected ZJH-VO multi-scale dataset demonstrates state-of-the-art performance, achieving 40% improvement in RTE compared to previous BEV methods. The ZJH-VO dataset, covering diverse ground vehicle scenarios from underground parking to outdoor plazas, is publicly available to facilitate future research. IRD'S-EYE-VIEW (BEV) representation has become a cornerstone for perception and localization tasks in modern intelligent transportation systems [1]-[3], offering a powerful solution to the scale drift problem inherent in Monoc-ular Visual Odometry (MVO) [4], [5]. For ground vehicles like autonomous cars and logistics robots, motion is predominantly planar [6]. This allows for simplifying pose estimation from six degrees of freedom (6-DoF) to a more robust 3-DoF model (x,y, yaw), which naturally aligns with the unified, metric-scaled grid of BEV representation [7]. This simplification not only reduces computational complexity but also mitigates the accumulation of errors in non-primary motion axes, a common source of drift in long-range navigation.


Deep Visual Odometry for Stereo Event Cameras

Zhong, Sheng, Niu, Junkai, Zhou, Yi

arXiv.org Artificial Intelligence

Event-based cameras are bio-inspired sensors with pixels that independently and asynchronously respond to brightness changes at microsecond resolution, offering the potential to handle state estimation tasks involving motion blur and high dynamic range (HDR) illumination conditions. However, the versatility of event-based visual odometry (VO) relying on handcrafted data association (either direct or indirect methods) is still unreliable, especially in field robot applications under low-light HDR conditions, where the dynamic range can be enormous and the signal-to-noise ratio is spatially-and-temporally varying. Leveraging deep neural networks offers new possibilities for overcoming these challenges. In this paper, we propose a learning-based stereo event visual odometry. Building upon Deep Event Visual Odometry (DEVO), our system (called Stereo-DEVO) introduces a novel and efficient static-stereo association strategy for sparse depth estimation with almost no additional computational burden. By integrating it into a tightly coupled bundle adjustment (BA) optimization scheme, and benefiting from the recurrent network's ability to perform accurate optical flow estimation through voxel-based event representations to establish reliable patch associations, our system achieves high-precision pose estimation in metric scale. In contrast to the offline performance of DEVO, our system can process event data of \zs{Video Graphics Array} (VGA) resolution in real time. Extensive evaluations on multiple public real-world datasets and self-collected data justify our system's versatility, demonstrating superior performance compared to state-of-the-art event-based VO methods. More importantly, our system achieves stable pose estimation even in large-scale nighttime HDR scenarios.


Good Deep Features to Track: Self-Supervised Feature Extraction and Tracking in Visual Odometry

Gottam, Sai Puneeth Reddy, Zhang, Haoming, Keras, Eivydas

arXiv.org Artificial Intelligence

Abstract--Visual-based localization has made significant progress, yet its performance often drops in large-scale, outdoor, and long-term settings due to factors like lighting changes, dynamic scenes, and low-texture areas. These challenges degrade feature extraction and tracking, which are critical for accurate motion estimation. While learning-based methods such as SuperPoint and SuperGlue show improved feature coverage and robustness, they still face generalization issues with out-of-distribution data. We address this by enhancing deep feature extraction and tracking through self-supervised learning with task-specific feedback. Our method promotes stable and informative features, improving generalization and reliability in challenging environments.


Odometry Calibration and Pose Estimation of a 4WIS4WID Mobile Wall Climbing Robot

Ćaran, Branimir, Milić, Vladimir, Švaco, Marko, Jerbić, Bojan

arXiv.org Artificial Intelligence

--This paper presents the design of a pose estimator for a four wheel independent steer four wheel independent drive (4WIS4WID) wall climbing mobile robot, based on the fusion of multimodal measurements, including wheel odometry, visual odometry, and an inertial measurement unit (IMU) data using Extended Kalman Filter (EKF) and Unscented Kalman Filter (UKF). The pose estimator is a critical component of wall climbing mobile robots, as their operational environment involves carrying precise measurement equipment and maintenance tools in construction, requiring information about pose on the building at the time of measurement. Due to the complex geometry and material properties of building fac ades, the use of traditional localization sensors such as laser, ultrasonic, or radar is often infeasible for wall-climbing robots. Moreover, GPS-based localization is generally unreliable in these environments because of signal degradation caused by reinforced concrete and electromagnetic interference. Consequently, robot odometry remains the primary source of velocity and position information, despite being susceptible to drift caused by both systematic and non-systematic errors. The calibrations of the robot's systematic parameters were conducted using nonlinear optimization and Levenberg-Marquardt methods as Newton-Gauss and gradient-based model fitting methods, while Genetic algorithm and Particle swarm were used as stochastic based methods for kinematic parameter calibration. Performance and results of the calibration methods and pose estimators were validated in detail with experiments on the experimental mobile wall climbing robot.



DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model

Azhari, Maulana Bisyir, Shim, David Hyunchul

arXiv.org Artificial Intelligence

Learning-based monocular visual odometry (VO) poses robustness, generalization, and efficiency challenges in robotics. Recent advances in visual foundation models, such as DINOv2, have improved robustness and generalization in various vision tasks, yet their integration in VO remains limited due to coarse feature granularity. In this paper, we present DINO-VO, a feature-based VO system leveraging DINOv2 visual foundation model for its sparse feature matching. To address the integration challenge, we propose a salient keypoints detector tailored to DINOv2's coarse features. Furthermore, we complement DINOv2's robust-semantic features with fine-grained geometric features, resulting in more localizable representations. Finally, a transformer-based matcher and differentiable pose estimation layer enable precise camera motion estimation by learning good matches. Against prior detector-descriptor networks like SuperPoint, DINO-VO demonstrates greater robustness in challenging environments. Furthermore, we show superior accuracy and generalization of the proposed feature descriptors against standalone DINOv2 coarse features. DINO-VO outperforms prior frame-to-frame VO methods on the TartanAir and KITTI datasets and is competitive on EuRoC dataset, while running efficiently at 72 FPS with less than 1GB of memory usage on a single GPU. Moreover, it performs competitively against Visual SLAM systems on outdoor driving scenarios, showcasing its generalization capabilities.


An Online Adaptation Method for Robust Depth Estimation and Visual Odometry in the Open World

Ji, Xingwu, Niu, Haochen, Duan, Dexin, Ying, Rendong, Wen, Fei, Liu, Peilin

arXiv.org Artificial Intelligence

Recently, learning-based robotic navigation systems have gained extensive research attention and made significant progress. However, the diversity of open-world scenarios poses a major challenge for the generalization of such systems to practical scenarios. Specifically, learned systems for scene measurement and state estimation tend to degrade when the application scenarios deviate from the training data, resulting to unreliable depth and pose estimation. Toward addressing this problem, this work aims to develop a visual odometry system that can fast adapt to diverse novel environments in an online manner. To this end, we construct a self-supervised online adaptation framework for monocular visual odometry aided by an online-updated depth estimation module. Firstly, we design a monocular depth estimation network with lightweight refiner modules, which enables efficient online adaptation. Then, we construct an objective for self-supervised learning of the depth estimation module based on the output of the visual odometry system and the contextual semantic information of the scene. Specifically, a sparse depth densification module and a dynamic consistency enhancement module are proposed to leverage camera poses and contextual semantics to generate pseudo-depths and valid masks for the online adaptation. Finally, we demonstrate the robustness and generalization capability of the proposed method in comparison with state-of-the-art learning-based approaches on urban, in-house datasets and a robot platform. Code is publicly available at: https://github.com/jixingwu/SOL-SLAM.


SeeTree -- A modular, open-source system for tree detection and orchard localization

Brown, Jostan, Grimm, Cindy, Davidson, Joseph R.

arXiv.org Artificial Intelligence

Accurate localization is an important functional requirement for precision orchard management. However, there are few off-the-shelf commercial solutions available to growers. In this paper, we present SeeTree, a modular, open source embedded system for tree trunk detection and orchard localization that is deployable on any vehicle. Building on our prior work on vision-based in-row localization using particle filters, SeeTree includes several new capabilities. First, it provides capacity for full orchard localization including out-of-row headland turning. Second, it includes the flexibility to integrate either visual, GNSS, or wheel odometry in the motion model. During field experiments in a commercial orchard, the system converged to the correct location 99% of the time over 800 trials, even when starting with large uncertainty in the initial particle locations. When turning out of row, the system correctly tracked 99% of the turns (860 trials representing 43 unique row changes). To help support adoption and future research and development, we make our dataset, design files, and source code freely available to the community.